Search AI Products and News

Explore worldwide AI information, discover new AI opportunities

✓AI News
AI Tools

Type :

✓AI News
AI Tools

2025-03-06 10:52:45.AIbase

IBM Launches Compact AI Model Granite 3.2, Emphasizing Efficient Inference and Practicality

IBM has released Granite 3.2, a smaller AI model designed for efficient inference and practical applications. This release focuses on delivering strong performance with reduced computational resources.

2025-02-20 16:44:24.AIbase

VLM-R1 Leads a New Era for Visual Language Models as Multimodal AI Achieves New Breakthroughs

Recently, the successful launch of the VLM-R1 project has brought new hope to this field. This project represents the successful migration of the R1 Method from the DeepSeek team into visual language models, indicating that AI's understanding of visual content will enter a whole new phase. The inspiration for VLM-R1 comes from last year's open-source R1 Method by DeepSeek, which leverages GRPO (Generative Reward Processing Optimization) reinforcement.

2025-02-08 16:45:45.AIbase

IBM Launches Visual Language Model Granite-Vision-3.1-2B, Effortlessly Analyzing Complex Documents

With the continuous development of artificial intelligence technology, the integration of visual and textual data has become a complex challenge. Traditional models often struggle to accurately parse structured visual documents such as tables, charts, infographics, and diagrams. This limitation impacts automated content extraction and comprehension capabilities, subsequently affecting applications in data analysis, information retrieval, and decision-making. In response to this demand, IBM recently released Granite-Vision-3.1-2B, a compact visual language model specifically designed for document understanding.

2025-01-20 14:04:10.AIbase

MIT and DeepMind Research Reveals Why Visual Language Models Struggle with Negation

In multimodal tasks, visual language models (VLMs) play a crucial role, such as in image retrieval, image captioning, and medical diagnosis. These models aim to align visual data with language data for more efficient information processing. However, current VLMs still face significant challenges in understanding negation. Negation is critical in many applications, such as distinguishing between "a room without windows" and "a room with windows." Despite significant progress made by VLMs, existing models fall short when it comes to handling negated statements.

2024-11-27 15:56:10.AIbase

Hugging Face Launches 2B Parameter Visual Language Model SmolVLM: Runs Quickly on Ordinary Devices

2024-11-08 13:59:34.AIbase

Compact and Powerful! Pocket-sized Visual AI Model Moondream2: Just 1.6 Billion Parameters, Runs on Mobile Phones

Recently, a startup in Seattle named Moondream launched a compact visual language model called moondream2. Despite its small size, the model has shown outstanding performance in various benchmark tests and has garnered significant attention. As an open-source model, moondream2 is expected to enable local image recognition capabilities on smartphones. Moondream2 was officially released in March and can process both text and image inputs, possessing functions such as answering questions, text extraction (OCR), object counting, and item classification.

2024-10-29 11:08:05.AIbase

Moondream Raises $4.5 Million to Launch a 1.6 Billion Parameter Efficient AI Model with 5K GitHub Stars

AI startup Moondream has officially announced the completion of $4.5 million in seed funding and presents a disruptive viewpoint: in the world of AI models, smaller models may hold advantages. The company is backed by Felicis Ventures, Microsoft's M12 GitHub Fund, and Ascend, launching a visual language model with only 1.6 billion parameters that can compete with models four times its size in terms of performance.

2024-10-18 10:11:29.AIbase

Small but Powerful! H2O.ai Launches New AI Visual Models to Outperform Tech Giants in Document Analysis

Recently, H2O.ai announced the launch of two new visual language models aimed at enhancing the efficiency of document analysis and optical character recognition (OCR) tasks. The two models, H2OVL Mississippi-2B and H2OVL-Mississippi-0.8B, demonstrate remarkable competitiveness compared to models from large tech companies, potentially offering businesses dealing with heavy document workflows a more efficient solution.

2024-09-05 11:28:29.AIbase

Alibaba Cloud's Tongyi Qwen Responds to Github Page 404: In Contact with Officials

Alibaba Group's Tongyi Qwen QwenLM project has encountered an unexpected takedown of its Github page, resulting in a 404 error message when users try to access it. Project leader Lin Junyang responded in a social media post, stating that the team has not disappeared and is in communication with officials to resolve the page takedown issue. Despite the team's efforts, the page remains inaccessible. It is worth noting that the team recently released the Qwen2-VL model, which has excelled in processing video content up to 20 minutes long, surpassing multiple authoritative evaluation metrics for multimodal models, with some metrics even exceeding those of GPT.

2024-09-02 14:21:57.AIbase

Tongyi Qwen Open Source Visual Language Model Qwen2-VL API Available in 2B and 7B Sizes

On September 2nd, Tongyi Qwen announced the open sourcing of its second-generation visual language model Qwen2-VL, and launched APIs for the 2B and 7B sizes as well as their quantized versions on the Aliyun Bailian platform for direct user access. The Qwen2-VL model achieves comprehensive performance improvements in several areas. It can understand images of different resolutions and aspect ratios, setting global leading performance on benchmarks such as DocVQA, RealWorldQA, and MTVQA.

2024-09-02 11:17:38.AIbase

NVIDIA Launches New Visual Speech Model NVEagle, Capable of Chatting with Images

NVIDIA has collaborated with several universities to introduce NVEagle, a large visual language model capable of chatting using images. NVEagle can analyze image content and provide accurate answers, such as identifying individuals in images, like Jensen Huang. The model significantly enhances the understanding of visual information by transforming images into visual tokens and combining them with text embeddings. In addressing the challenges of high-resolution image processing, the research team has constructed models like Eagle-X5-7B and Eagle-X by exploring various visual encoders and fusion strategies.

2024-07-17 13:47:02.AIbase

智源AI has unveiled EVE, its latest generation encoder-less vision-language multimodal large model.

Recently, significant progress has been made in the research and application of multimodal large models. Foreign companies such as OpenAI, Google, and Microsoft have introduced a series of advanced models, and domestic institutions such as Zhipu AI and Jietu Xingchen have also made breakthroughs in this field. These models typically rely on visual encoders to extract visual features and integrate them with large language models, but they face issues such as visual inductive bias due to training

2023-12-27 15:35:05.AIbase

Tsinghua University Develops New Visual Language Model CogAgent to Enhance GUI Understanding and Navigation

The Tsinghua University ZhiPu AI team has released a new visual language model called CogAgent, which focuses on understanding and navigating graphical user interfaces (GUIs). CogAgent uses a dual-encoder system to process complex GUI elements and text, showing outstanding performance with high-resolution inputs of 1120x1120 pixels. The model outperforms existing LLM methods in GUI navigation tasks on PC and Android platforms, and also excels in text and visual question-answering benchmarks. Potential applications include automated GUI operations, providing G

2023-12-21 08:37:02.AIbase

Zhipu AI Open-Source Visual Language Model CogAgent Supports GUI Graphic Interface Q&A

Zhipu AI has open-sourced CogAgent, a visual language model with 18 billion parameters. CogAgent excels in GUI understanding and navigation, achieving state-of-the-art general performance across multiple benchmark tests. The model supports high-resolution visual input and dialog Q&A, and can answer questions based on any GUI screenshot. CogAgent also supports OCR-related tasks, with its capabilities significantly enhanced through pre-training and fine-tuning.

2023-10-27 09:02:48.AIbase

Google Releases Lightweight PaLI-3 Visual Language Model Achieving SOTA Performance

Google has released the PaLI-3 visual language model, which is lightweight yet achieves SOTA performance. This model employs a contrastive pre-training method and deeply explores the potential of VIT, demonstrating outstanding performance in multilingual modality retrieval. PaLI-3 perfectly integrates natural language understanding and image recognition, becoming an important force in AI innovation. The model's SigLIP-based contrastive pre-training method opens up a new era for multilingual cross-modal retrieval. Although it has not been fully open-sourced, multilingual and English SigLIP models have been released, providing opportunities for researchers to experiment.

2023-10-25 15:39:22.AIbase

Xi Xiaoyao Technology Talk | Stop Saying GPT-4V is Amazing! It Can't Even Recognize Peking Duck, Can You Believe It??

The newly proposed image reasoning benchmark HallusionBench is used to examine visual language models like GPT-4V, revealing issues with language and visual hallucinations. Models like GPT-4V exhibit a high error rate of up to 90% in generating language hallucinations influenced by parametric memory within HallusionBench. Additionally, models such as GPT-4V are prone to geometric and other visual illusions, indicating that their current visual capabilities are still limited. Simple image manipulations can easily mislead these models, reflecting their fragility.

2023-08-25 14:08:54.AIbase

Tongyi Qianwen Can Now Process Images! Alibaba Cloud Open Sources Visual Language Model Qwen-VL, Supporting Multi-Modal Input of Text and Images

Following the open sourcing of the general model Qwen-7B in August, Alibaba Cloud has released the visual language model Qwen-VL. Qwen-VL supports both Chinese and English and can perform tasks such as image captioning, image Q&A, and document Q&A. Qwen-VL also enables open-domain visual localization in Chinese by marking detection boxes within images.

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

Search AI Products and News

Explore worldwide AI information, discover new AI opportunities

IBM Launches Compact AI Model Granite 3.2, Emphasizing Efficient Inference and Practicality

VLM-R1 Leads a New Era for Visual Language Models as Multimodal AI Achieves New Breakthroughs

IBM Launches Visual Language Model Granite-Vision-3.1-2B, Effortlessly Analyzing Complex Documents

MIT and DeepMind Research Reveals Why Visual Language Models Struggle with Negation

Hugging Face Launches 2B Parameter Visual Language Model SmolVLM: Runs Quickly on Ordinary Devices

Compact and Powerful! Pocket-sized Visual AI Model Moondream2: Just 1.6 Billion Parameters, Runs on Mobile Phones

Moondream Raises $4.5 Million to Launch a 1.6 Billion Parameter Efficient AI Model with 5K GitHub Stars

Small but Powerful! H2O.ai Launches New AI Visual Models to Outperform Tech Giants in Document Analysis

Alibaba Cloud's Tongyi Qwen Responds to Github Page 404: In Contact with Officials

Tongyi Qwen Open Source Visual Language Model Qwen2-VL API Available in 2B and 7B Sizes

NVIDIA Launches New Visual Speech Model NVEagle, Capable of Chatting with Images

智源AI has unveiled EVE, its latest generation encoder-less vision-language multimodal large model.

Tsinghua University Develops New Visual Language Model CogAgent to Enhance GUI Understanding and Navigation

Zhipu AI Open-Source Visual Language Model CogAgent Supports GUI Graphic Interface Q&A

Google Releases Lightweight PaLI-3 Visual Language Model Achieving SOTA Performance

Xi Xiaoyao Technology Talk | Stop Saying GPT-4V is Amazing! It Can't Even Recognize Peking Duck, Can You Believe It??

Tongyi Qianwen Can Now Process Images! Alibaba Cloud Open Sources Visual Language Model Qwen-VL, Supporting Multi-Modal Input of Text and Images